169 research outputs found

    ALGORITHMS AND HIGH PERFORMANCE COMPUTING APPROACHES FOR SEQUENCING-BASED COMPARATIVE GENOMICS

    Get PDF
    As cost and throughput of second-generation sequencers continue to improve, even modestly resourced research laboratories can now perform DNA sequencing experiments that generate hundreds of billions of nucleotides of data, enough to cover the human genome dozens of times over, in about a week for a few thousand dollars. Such data are now being generated rapidly by research groups across the world, and large-scale analyses of these data appear often in high-profile publications such as Nature, Science, and The New England Journal of Medicine. But with these advances comes a serious problem: growth in per-sequencer throughput (currently about 4x per year) is drastically outpacing growth in computer speed (about 2x every 2 years). As the throughput gap widens over time, sequence analysis software is becoming a performance bottleneck, and the costs associated with building and maintaining the needed computing resources is burdensome for research laboratories. This thesis proposes two methods and describes four open source software tools that help to address these issues using novel algorithms and high-performance computing techniques. The proposed approaches build primarily on two insights. First, that the Burrows-Wheeler Transform and the FM Index, previously used for data compression and exact string matching, can be extended to facilitate fast and memory-efficient alignment of DNA sequences to long reference genomes such as the human genome. Second, that these algorithmic advances can be combined with MapReduce and cloud computing to solve comparative genomics problems in a manner that is scalable, fault tolerant, and usable even by small research groups

    BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions

    Get PDF
    DNA methylation is an important epigenetic modification involved in gene regulation, which can now be measured using whole-genome bisulfite sequencing. However, cost, complexity of the data, and lack of comprehensive analytical tools are major challenges that keep this technology from becoming widely applied. Here we present BSmooth, an alignment, quality control and analysis pipeline that provides accurate and precise results even with low coverage data, appropriately handling biological replicates. BSmooth is open source software, and can be downloaded from http://rafalab.jhsph.edu/bsmooth

    Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing

    Get PDF
    Improvements in DNA sequencing have both broadened its utility and dramatically increased the size of sequencing datasets. Sequencing instruments are now used regularly as sources of high-resolution evidence for genotyping, methylation profiling, DNA-protein interaction mapping, and characterizing gene expression in the human genome and in other species. With existing methods, the computational cost of aligning short reads from the Illumina instrument to a mammalian genome can be very large: on the order of many CPU months for one human genotyping project. This thesis presents a novel application of the Burrows-Wheeler Transform that enables the alignment of short DNA sequences to mammalian genomes at a rate much faster than existing hashtable-based methods. The thesis also presents an extension of the technique that exploits the scalability of Cloud Computing to perform the equivalent of one human genotyping project in hours

    B-SOLANA: an approach for the analysis of two-base encoding bisulfite sequencing data

    Get PDF
    Summary: Bisulfite sequencing, a combination of bisulfite treatment and high-throughput sequencing, has proved to be a valuable method for measuring DNA methylation at single base resolution. Here, we present B-SOLANA, an approach for the analysis of two-base encoding (colorspace) bisulfite sequencing data on the SOLiD platform of Life Technologies. It includes the alignment of bisulfite sequences and the determination of methylation levels in CpG as well as non-CpG sequence contexts. B-SOLANA enables a fast and accurate analysis of large raw sequence datasets

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

    Virulence Regulator EspR of Mycobacterium tuberculosis Is a Nucleoid-Associated Protein

    Get PDF
    The principal virulence determinant of Mycobacterium tuberculosis (Mtb), the ESX-1 protein secretion system, is positively controlled at the transcriptional level by EspR. Depletion of EspR reportedly affects a small number of genes, both positively or negatively, including a key ESX-1 component, the espACD operon. EspR is also thought to be an ESX-1 substrate. Using EspR-specific antibodies in ChIP-Seq experiments (chromatin immunoprecipitation followed by ultra-high throughput DNA sequencing) we show that EspR binds to at least 165 loci on the Mtb genome. Included in the EspR regulon are genes encoding not only EspA, but also EspR itself, the ESX-2 and ESX-5 systems, a host of diverse cell wall functions, such as production of the complex lipid PDIM (phenolthiocerol dimycocerosate) and the PE/PPE cell-surface proteins. EspR binding sites are not restricted to promoter regions and can be clustered. This suggests that rather than functioning as a classical regulatory protein EspR acts globally as a nucleoid-associated protein capable of long-range interactions consistent with a recently established structural model. EspR expression was shown to be growth phase-dependent, peaking in the stationary phase. Overexpression in Mtb strain H37Rv revealed that EspR influences target gene expression both positively or negatively leading to growth arrest. At no stage was EspR secreted into the culture filtrate. Thus, rather than serving as a specific activator of a virulence locus, EspR is a novel nucleoid-associated protein, with both architectural and regulatory roles, that impacts cell wall functions and pathogenesis through multiple genes

    Tuning transcription factor availability through acetylation-mediated genomic redistribution

    Get PDF
    It is widely assumed that decreasing transcription factor DNA-binding affinity reduces transcription initiation by diminishing occupancy of sequence-specific regulatory elements. However, in vivo transcription factors find their binding sites while confronted with a large excess of low-affinity degenerate motifs. Here, using the melanoma lineage survival oncogene MITF as a model, we show that low-affinity binding sites act as a competitive reservoir in vivo from which transcription factors are released by mitogen-activated protein kinase (MAPK)-stimulated acetylation to promote increased occupancy of their regulatory elements. Consequently, a low-DNA-binding-affinity acetylation-mimetic MITF mutation supports melanocyte development and drives tumorigenesis, whereas a high-affinity non-acetylatable mutant does not. The results reveal a paradoxical acetylation-mediated molecular clutch that tunes transcription factor availability via genome-wide redistribution and couples BRAF to tumorigenesis. Our results further suggest that p300/CREB-binding protein-mediated transcription factor acetylation may represent a common mechanism to control transcription factor availability

    Mechanisms of stretch-mediated skin expansion at single-cell resolution.

    Get PDF
    The ability of the skin to grow in response to stretching has been exploited in reconstructive surgery1. Although the response of epidermal cells to stretching has been studied in vitro2,3, it remains unclear how mechanical forces affect their behaviour in vivo. Here we develop a mouse model in which the consequences of stretching on skin epidermis can be studied at single-cell resolution. Using a multidisciplinary approach that combines clonal analysis with quantitative modelling and single-cell RNA sequencing, we show that stretching induces skin expansion by creating a transient bias in the renewal activity of epidermal stem cells, while a second subpopulation of basal progenitors remains committed to differentiation. Transcriptional and chromatin profiling identifies how cell states and gene-regulatory networks are modulated by stretching. Using pharmacological inhibitors and mouse mutants, we define the step-by-step mechanisms that control stretch-mediated tissue expansion at single-cell resolution in vivo.Wellcome Trust Royal Societ

    A Computational and Experimental Study of the Regulatory Mechanisms of the Complement System

    Get PDF
    The complement system is key to innate immunity and its activation is necessary for the clearance of bacteria and apoptotic cells. However, insufficient or excessive complement activation will lead to immune-related diseases. It is so far unknown how the complement activity is up- or down- regulated and what the associated pathophysiological mechanisms are. To quantitatively understand the modulatory mechanisms of the complement system, we built a computational model involving the enhancement and suppression mechanisms that regulate complement activity. Our model consists of a large system of Ordinary Differential Equations (ODEs) accompanied by a dynamic Bayesian network as a probabilistic approximation of the ODE dynamics. Applying Bayesian inference techniques, this approximation was used to perform parameter estimation and sensitivity analysis. Our combined computational and experimental study showed that the antimicrobial response is sensitive to changes in pH and calcium levels, which determines the strength of the crosstalk between CRP and L-ficolin. Our study also revealed differential regulatory effects of C4BP. While C4BP delays but does not decrease the classical complement activation, it attenuates but does not significantly delay the lectin pathway activation. We also found that the major inhibitory role of C4BP is to facilitate the decay of C3 convertase. In summary, the present work elucidates the regulatory mechanisms of the complement system and demonstrates how the bio-pathway machinery maintains the balance between activation and inhibition. The insights we have gained could contribute to the development of therapies targeting the complement system.Singapore. Ministry of Education (Grant T208B3109)Singapore. Agency for Science, Technology and Research (BMRC 08/1/21/19/574)Singapore-MIT Alliance (Computational and Systems Biology Flagship Project)Swedish Research Counci
    corecore